One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, Tony Robinson
(Submitted on 11 Dec 2013 (v1), last revised 4 Mar 2014 (this version, v3))

We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 67.6; a combination of techniques leads to 35% reduction in perplexity, or 10% reduction in cross-entropy (bits), over that baseline.
The benchmark is available as a code.google.com project; besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the baseline n-gram models.

Comments: Accompanied by a code.google.com project allowing anyone to generate the benchmark data, and use it to compare their language model against the ones described in the paper
Subjects: Computation and Language (cs.CL)
Cite as: arXiv:1312.3005 [cs.CL]
(or arXiv:1312.3005v3 [cs.CL] for this version)

Language Model on One Billion Word Benchmark

Authors:

Oriol Vinyals (vinyals@google.com, github: OriolVinyals),
Xin Pan

Paper Authors:

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu

TL;DR

This is a pretrained model on One Billion Word Benchmark.
If you use this model in your publication, please cite the original paper:

@article{jozefowicz2016exploring,
title={Exploring the Limits of Language Modeling},
author={Jozefowicz, Rafal and Vinyals, Oriol and Schuster, Mike
and Shazeer, Noam and Wu, Yonghui},
journal={arXiv preprint arXiv:1602.02410},
year={2016}
}

Introduction

In this release, we open source a model trained on the One Billion Word
Benchmark (http://arxiv.org/abs/1312.3005), a large language corpus in English
which was released in 2013. This dataset contains about one billion words, and
has a vocabulary size of about 800K words. It contains mostly news data. Since
sentences in the training set are shuffled, models can ignore the context and
focus on sentence level language modeling.

In the original release and subsequent work, people have used the same test set
to train models on this dataset as a standard benchmark for language modeling.
Recently, we wrote an article (http://arxiv.org/abs/1602.02410) describing a
model hybrid between character CNN, a large and deep LSTM, and a specific
Softmax architecture which allowed us to train the best model on this dataset
thus far, almost halving the best perplexity previously obtained by others.

Code Release

The open-sourced components include:

TensorFlow GraphDef proto buffer text file.
TensorFlow pre-trained checkpoint shards.
Code used to evaluate the pre-trained model.
Vocabulary file.
Test set from LM-1B evaluation.

The code supports 4 evaluation modes:

Given provided dataset, calculate the model’s perplexity.
Given a prefix sentence, predict the next words.
Dump the softmax embedding, character-level CNN word embeddings.
Give a sentence, dump the embedding from the LSTM state.

Results

Model	Test Perplexity	Number of Params [billions]
Sigmoid-RNN-2048 [Blackout]	68.3	4.1
Interpolated KN 5-gram, 1.1B n-grams [chelba2013one]	67.6	1.76
Sparse Non-Negative Matrix LM [shazeer2015sparse]	52.9	33
RNN-1024 + MaxEnt 9-gram features [chelba2013one]	51.3	20
LSTM-512-512	54.1	0.82
LSTM-1024-512	48.2	0.82
LSTM-2048-512	43.7	0.83
LSTM-8192-2048 (No Dropout)	37.9	3.3
LSTM-8192-2048 (50\% Dropout)	32.2	3.3
2-Layer LSTM-8192-1024 (BIG LSTM)	30.6	1.8
(THIS RELEASE) BIG LSTM+CNN Inputs	30.0	1.04

How To Run

Prerequisites:

Install TensorFlow.
Install Bazel.
Download the data files:
- Model GraphDef file:
  link
- Model Checkpoint sharded file:
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
- Vocabulary file:
  link
- test dataset: link
  link
It is recommended to run on a modern desktop instead of a laptop.

# 1. Clone the code to your workspace.
# 2. Download the data to your workspace.
# 3. Create an empty WORKSPACE file in your workspace.
# 4. Create an empty output directory in your workspace.
# Example directory structure below:
$ ls -R
.:
data  lm_1b  output  WORKSPACE

./data:
ckpt-base            ckpt-lstm      ckpt-softmax1  ckpt-softmax3  ckpt-softmax5
ckpt-softmax7  graph-2016-09-10.pbtxt          vocab-2016-09-10.txt
ckpt-char-embedding  ckpt-softmax0  ckpt-softmax2  ckpt-softmax4  ckpt-softmax6
ckpt-softmax8  news.en.heldout-00000-of-00050

./lm_1b:
BUILD  data_utils.py  lm_1b_eval.py  README.md

./output:

# Build the codes.
$ bazel build -c opt lm_1b/...
# Run sample mode:
$ bazel-bin/lm_1b/lm_1b_eval --mode sample \
                             --prefix "I love that I" \
                             --pbtxt data/graph-2016-09-10.pbtxt \
                             --vocab_file data/vocab-2016-09-10.txt  \
                             --ckpt 'data/ckpt-*'
...(omitted some TensorFlow output)
I love
I love that
I love that I
I love that I find
I love that I find that
I love that I find that amazing
...(omitted)

# Run eval mode:
$ bazel-bin/lm_1b/lm_1b_eval --mode eval \
                             --pbtxt data/graph-2016-09-10.pbtxt \
                             --vocab_file data/vocab-2016-09-10.txt  \
                             --input_data data/news.en.heldout-00000-of-00050 \
                             --ckpt 'data/ckpt-*'
...(omitted some TensorFlow output)
Loaded step 14108582.
# perplexity is high initially because words without context are harder to
# predict.
Eval Step: 0, Average Perplexity: 2045.512297.
Eval Step: 1, Average Perplexity: 229.478699.
Eval Step: 2, Average Perplexity: 208.116787.
Eval Step: 3, Average Perplexity: 338.870601.
Eval Step: 4, Average Perplexity: 228.950107.
Eval Step: 5, Average Perplexity: 197.685857.
Eval Step: 6, Average Perplexity: 156.287063.
Eval Step: 7, Average Perplexity: 124.866189.
Eval Step: 8, Average Perplexity: 147.204975.
Eval Step: 9, Average Perplexity: 90.124864.
Eval Step: 10, Average Perplexity: 59.897914.
Eval Step: 11, Average Perplexity: 42.591137.
...(omitted)
Eval Step: 4529, Average Perplexity: 29.243668.
Eval Step: 4530, Average Perplexity: 29.302362.
Eval Step: 4531, Average Perplexity: 29.285674.
...(omitted. At convergence, it should be around 30.)

# Run dump_emb mode:
$ bazel-bin/lm_1b/lm_1b_eval --mode dump_emb \
                             --pbtxt data/graph-2016-09-10.pbtxt \
                             --vocab_file data/vocab-2016-09-10.txt  \
                             --ckpt 'data/ckpt-*' \
                             --save_dir output
...(omitted some TensorFlow output)
Finished softmax weights
Finished word embedding 0/793471
Finished word embedding 1/793471
Finished word embedding 2/793471
...(omitted)
$ ls output/
embeddings_softmax.npy ...

# Run dump_lstm_emb mode:
$ bazel-bin/lm_1b/lm_1b_eval --mode dump_lstm_emb \
                             --pbtxt data/graph-2016-09-10.pbtxt \
                             --vocab_file data/vocab-2016-09-10.txt \
                             --ckpt 'data/ckpt-*' \
                             --sentence "I love who I am ." \
                             --save_dir output
$ ls output/
lstm_emb_step_0.npy  lstm_emb_step_2.npy  lstm_emb_step_4.npy
lstm_emb_step_6.npy  lstm_emb_step_1.npy  lstm_emb_step_3.npy
lstm_emb_step_5.npy

人工智能

待读One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

Language Model on One Billion Word Benchmark